49
DOI: 10.1201/9781003355205-2
C h a p t e r 2
Mapping of Sequence Reads
to the Reference Genomes
2.1 INTRODUCTION TO SEQUENCE MAPPING
So far, we have already gone through the first two steps of the NGS/HTP data analysis,
namely acquiring the raw data in FASTQ file format and read quality control. Up to this
point, you know that the sequencing raw data must be cleaned from errors and artifacts,
as much as possible, before moving on to the next step of the data analysis. This chapter
discusses the alignment of reads (short or long) to a reference genome of an organism. This
step is crucial for most of the sequencing applications including reference-guided genome
assembly, variant discovery, gene expression (RNA-Seq), epigenetics (ChIP-Seq, Methyl-
Seq), and metagenomics (targeted and shotgun). The reference genome sequence of an
organism is a key element of read alignment or mapping. Scientists have devoted enormous
amount of time and efforts to determine the sequences of many organisms. Complete
genomes of hundreds of organisms have already been sequenced and the list continues to
grow. The sequencing of human genome was completed in 2003 by the National Human
Genome Research Institute (NHGRI), followed by sequencing the genomes of a vari-
ety of model organisms that are used as surrogates in studying the human biology, then
genomes of numerous of organisms, including some extinct organisms like Neanderthals,
were sequenced. The first sequenced genomes of model organisms include the rat, puffer
fish, fruit fly, sea squirt, roundworm, and the bacterium Escherichia coli. The NHGRI has
sequenced numerous species with the aim to provide data for understanding genetic varia-
tions among organisms. Genome sequences are available in sequence databases funded
by governments and supported by institutions. A reference genome sequence of an organ-
ism is a curated sequence that represents the genome of the individuals of that organism.
However, the sequences of the individuals are varied and the reference sequence is only a
sequence that we compare other sequences to. These days, there are reference genomes for
thousands of organisms, including animal, plants, fungi, bacteria, archaea, and viruses,